flowchart TD;
A[Start] --> B[Load Dataset]
B --> C[Data Cleaning]
C --> D[Statistical Analysis]
D --> E[Exploratory Data Analysis]
E --> F[Outlier Detection]
F --> G[Linear Regression Model]
G --> H[Model Evaluation]
H --> I[Conclusion]
Assignment 1
Black Friday Sales Data Analysis
1. Introduction
Black Friday is one of the largest shopping events worldwide, where customers make bulk purchases across different product categories. Understanding customer behavior and product performance is crucial for businesses to optimize marketing strategies and improve sales.
2. Dataset Description
The dataset consists of transactions from a retail store on Black Friday. It includes information about customer demographics, product categories, and purchase amounts.
3. Column Descriptions
| Column Name | Description |
|---|---|
| User_ID | Unique customer identifier |
| Product_ID | Unique product identifier |
| Gender | Customer gender (M/F) |
| Age | Age group of the customer |
| Occupation | Customer occupation category (masked) |
| City_Category | City type (A = Metro, B = Tier 1, C = Tier 2) |
| Marital_Status | Marital status (0 = Single, 1 = Married) |
| Product_Category_1 | Primary product category |
| Purchase | Purchase amount |
Columns Removed: Product_Category_2, Product_Category_3, Stay_In_Current_City_Years (as they are either redundant or not useful for the analysis).
Workflow
Loading Libraries
### Load Required Libraries
library(tidyverse)
library(kableExtra)
library(dplyr)
library(tibble)
library(performance)
library(ggplot2)
library(scales)4. Data Import and Cleaning
1. Loading Dataset
The dataset was imported using the read_csv() function and stored in a dataframe.
# Read the dataset
black_friday <- read_csv("/Users/amitkumar/Desktop/MSc Business Analytics/Sem 2/applied analytics/r_applied_analytics/Assignment_1/dataset/Black_friday_sales_dataset.csv")
# Convert categorical variables to factors
black_friday <- black_friday %>%
mutate(
Gender = as.factor(Gender),
Age = as.factor(Age),
City_Category = as.factor(City_Category),
Occupation = as.factor(Occupation),
Marital_Status = as.factor(Marital_Status),
Product_Category_1 = as.factor(Product_Category_1)
)2. Structure of Dataset
# Check the structure of the dataset
str(black_friday)tibble [550,068 × 12] (S3: tbl_df/tbl/data.frame)
$ User_ID : num [1:550068] 1e+06 1e+06 1e+06 1e+06 1e+06 ...
$ Product_ID : chr [1:550068] "P00069042" "P00248942" "P00087842" "P00085442" ...
$ Gender : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Age : Factor w/ 7 levels "0-17","18-25",..: 1 1 1 1 7 3 5 5 5 3 ...
$ Occupation : Factor w/ 21 levels "0","1","2","3",..: 11 11 11 11 17 16 8 8 8 21 ...
$ City_Category : Factor w/ 3 levels "A","B","C": 1 1 1 1 3 1 2 2 2 1 ...
$ Stay_In_Current_City_Years: chr [1:550068] "2" "2" "2" "2" ...
$ Marital_Status : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 2 2 2 ...
$ Product_Category_1 : Factor w/ 20 levels "1","2","3","4",..: 3 1 12 12 8 1 1 1 1 8 ...
$ Product_Category_2 : num [1:550068] NA 6 NA 14 NA 2 8 15 16 NA ...
$ Product_Category_3 : num [1:550068] NA 14 NA NA NA NA 17 NA NA NA ...
$ Purchase : num [1:550068] 8370 15200 1422 1057 7969 ...
3. View the first few rows
# View the first few rows
head(black_friday)# A tibble: 6 × 12
User_ID Product_ID Gender Age Occupation City_Category
<dbl> <chr> <fct> <fct> <fct> <fct>
1 1000001 P00069042 F 0-17 10 A
2 1000001 P00248942 F 0-17 10 A
3 1000001 P00087842 F 0-17 10 A
4 1000001 P00085442 F 0-17 10 A
5 1000002 P00285442 M 55+ 16 C
6 1000003 P00193542 M 26-35 15 A
# ℹ 6 more variables: Stay_In_Current_City_Years <chr>, Marital_Status <fct>,
# Product_Category_1 <fct>, Product_Category_2 <dbl>,
# Product_Category_3 <dbl>, Purchase <dbl>
Getting a random view of dataset for a better understanding
# Getting a random view of dataset for a better understanding
sample_n(black_friday,10)# A tibble: 10 × 12
User_ID Product_ID Gender Age Occupation City_Category
<dbl> <chr> <fct> <fct> <fct> <fct>
1 1003945 P00050342 F 46-50 1 B
2 1002309 P00177142 F 46-50 0 C
3 1002552 P00110842 F 26-35 9 C
4 1002931 P00139942 M 18-25 4 B
5 1001722 P00180342 M 36-45 0 C
6 1005892 P00119242 M 46-50 2 C
7 1001470 P00246442 M 18-25 4 A
8 1004859 P00282642 F 18-25 14 C
9 1002482 P00298742 F 51-55 3 C
10 1002820 P00252442 F 36-45 0 A
# ℹ 6 more variables: Stay_In_Current_City_Years <chr>, Marital_Status <fct>,
# Product_Category_1 <fct>, Product_Category_2 <dbl>,
# Product_Category_3 <dbl>, Purchase <dbl>
4. Removing Unnecessary Columns
Columns such as Stay_In_Current_City_Years, Product_Category_2 and Product_Category_3 were removed as they do not contribute to the analysis.
# Remove unnecessary columns
black_friday <- black_friday %>% select(-c(Product_Category_2, Product_Category_3, Stay_In_Current_City_Years))5. Handling Missing Values
# Check missing values
missing_values <- colSums(is.na(black_friday))
print(missing_values) User_ID Product_ID Gender Age
0 0 0 0
Occupation City_Category Marital_Status Product_Category_1
0 0 0 0
Purchase
0
5. Statistical Analysis
# used chatgpt for code
### 5. Statistical Metrics, as there is only one numeric attribute i.e. Purchase
stats_summary <- tibble(
Statistic = c("Mean", "Median", "Std Dev", "Min", "Max"),
Purchase = c(mean(black_friday$Purchase), median(black_friday$Purchase),
sd(black_friday$Purchase), min(black_friday$Purchase), max(black_friday$Purchase))
)
print(stats_summary)# A tibble: 5 × 2
Statistic Purchase
<chr> <dbl>
1 Mean 9264.
2 Median 8047
3 Std Dev 5023.
4 Min 12
5 Max 23961
6. Exploratory Data Analysis
6.1 The percentage of spending by gender
# Total spending by gender
gender_spending <- black_friday %>%
group_by(Gender) %>%
summarise(Total_Spending = sum(Purchase, na.rm = TRUE))
# Calculate the percentage of total spending
gender_spending <- gender_spending %>%
mutate(Percentage_Spending = (Total_Spending / sum(Total_Spending)) * 100)
# used chatgpt for code
# Plot the percentage of spending by gender
ggplot(gender_spending, aes(x = Gender, y = Percentage_Spending, fill = Gender)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel1") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format as percentage
labs(title = "Percentage of Total Spending by Gender",
x = "Gender",
y = "Percentage of Total Spending") +
theme_minimal()Males contributed significantly more with 78% to total sales than females with 22%.
6.2 Spending by Age Group
# used chatgpt for code
# Total spending by each age group
age_spending <- black_friday %>%
group_by(Age) %>%
summarise(Total_Spending = sum(Purchase))
# Calculate the percentage of total spending
age_spending <- age_spending %>%
mutate(Percentage_Spending = (Total_Spending / sum(Total_Spending)) * 100)
# Plot the percentage of spending by age group
ggplot(age_spending, aes(x = Age, y = Percentage_Spending, fill = Age)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel2") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format as percentage
labs(title = "Percentage of Total Spending by Age Group",
x = "Age Group",
y = "Percentage of Total Spending") +
theme_minimal()Customers in the 26-35 age group had the highest total spending.
Older age groups showed lower spending trends.
6.3 Top 5 Product Categories by Revenue
# Top 5 product categories by revenue
product_revenue <- black_friday %>%
group_by(Product_Category_1) %>%
summarise(Total_Revenue = sum(Purchase)) %>%
arrange(desc(Total_Revenue)) %>%
head(5)
# used chatgpt for code
ggplot(product_revenue, aes(x = reorder(Product_Category_1, Total_Revenue), y = Total_Revenue, fill = Product_Category_1)) +
geom_bar(stat = "identity", show.legend = FALSE) +
scale_fill_brewer(palette = "Pastel2") +
scale_y_continuous(labels = scales::comma) +
labs(title = "Top 5 Product Categories by Revenue", x = "Product Category", y = "Total Revenue") +
coord_flip() +
theme_minimal()7. Outlier Detection & Removal
# Check for outliers using boxplot
boxplot(black_friday$Purchase, main = "Boxplot of Purchase Amounts", col = "lightblue")# Remove outliers using IQR method
Q1 <- quantile(black_friday$Purchase, 0.25)
Q3 <- quantile(black_friday$Purchase, 0.75)
IQR_value <- Q3 - Q1
# Define upper and lower limits
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Filter out outliers (Regular Transactions)
black_friday_clean <- black_friday %>%
filter(Purchase >= lower_bound & Purchase <= upper_bound)
# Identify high-value transactions (Potential Premium Purchases)
extreme_high_values <- black_friday %>%
filter(Purchase > upper_bound)Outliers were detected using the IQR method.
Transactions exceeding the upper bound were considered high-value transactions.
A boxplot was used to visualize these high-value purchases.
# Show how many rows were removed
paste("Rows before cleaning:", nrow(black_friday))[1] "Rows before cleaning: 550068"
paste("Rows after cleaning:", nrow(black_friday_clean))[1] "Rows after cleaning: 547391"
paste("Rows removed:", nrow(black_friday) - nrow(black_friday_clean))[1] "Rows removed: 2677"
# Display top high-value transactions
# Potential premium purchases
head(extreme_high_values)# A tibble: 6 × 9
User_ID Product_ID Gender Age Occupation City_Category Marital_Status
<dbl> <chr> <fct> <fct> <fct> <fct> <fct>
1 1000058 P00117642 M 26-35 2 B 0
2 1000062 P00119342 F 36-45 3 A 0
3 1000126 P00087042 M 18-25 9 B 0
4 1000139 P00159542 F 26-35 20 C 0
5 1000175 P00052842 F 26-35 2 B 0
6 1000235 P00116142 M 26-35 0 B 0
# ℹ 2 more variables: Product_Category_1 <fct>, Purchase <dbl>
8. Predictive Modeling: Linear Regression
1. Selecting Features
Independent variables: Gender, Age, and City_Category.
The categorical variables were converted into numeric form.
# used chatgpt for code
# Convert categorical variables to numeric factors
black_friday <- black_friday %>%
mutate(
Gender = as.numeric(as.factor(Gender)), # Male = 1, Female = 2
Age = as.numeric(as.factor(Age)), # Convert Age groups to numbers
City_Category = as.numeric(as.factor(City_Category)) # Convert City A/B/C to numbers
)
# Build linear regression model
linear_model <- lm(Purchase ~ Gender + Age + City_Category, data = black_friday)
# Display model summary
summary(linear_model)
Call:
lm(formula = Purchase ~ Gender + Age + City_Category, data = black_friday)
Residuals:
Min 1Q Median 3Q Max
-9925 -3525 -1127 2923 15653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7088.902 36.997 191.609 < 2e-16 ***
Gender 706.601 15.647 45.159 < 2e-16 ***
Age 31.812 5.023 6.334 2.39e-10 ***
City_Category 403.933 8.943 45.166 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5004 on 550064 degrees of freedom
Multiple R-squared: 0.007581, Adjusted R-squared: 0.007576
F-statistic: 1401 on 3 and 550064 DF, p-value: < 2.2e-16
The model tries to predict purchase amount based on Gender, Age, and City Category.
It is statistically significant but explains less than 1% of the variation in purchases, meaning it does not predict well.
9. Conclusion
This analysis of Black Friday Sales data helped us understand customer spending habits. We also identified high-value purchases using outlier detection and built a Linear Regression model to predict spending. These insights can help businesses target the right customers, improve marketing, and manage inventory better. Future improvements could include using advanced machine learning models to make better predictions.